Goto

Collaborating Authors

 Nye County


When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

Jeong, Soyeong, Jung, Taehee, Hwang, Sung Ju, Kim, Joo-Kyung, Kang, Dongyeop

arXiv.org Artificial Intelligence

Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).


WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Ning, Kangyun, Su, Yisong, Lv, Xueqiang, Zhang, Yuanzhe, Liu, Jian, Liu, Kang, Xu, Jinan

arXiv.org Artificial Intelligence

Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets improves when their ability is similar to ChatGPT. In both datasets, incorrect tool usage significantly impairs LLMs' performance. To mitigate this, we also develop the finetuning dataset to enhance tool decision-making. Fine-tuning Llama2-7B results in a 14\% average performance improvement and a 16.8\% decrease in incorrect tool usage. We will release the WTU-Eval benchmark.


What Happened When ChatGPT Got Hold of My Online Dating Profile - CNET

#artificialintelligence

For the record, I don't own socks with sloths on them. I have three pairs with the CNET logo on them. ChatGPT thinks I might, though, and it also thinks this fact could get me matches on Hinge, or Bumble, or any dating app that has the audacity to ask me for a random fact about myself. Click to read more Love Syncs. Here's a random fact about me: When I tested how ChatGPT might handle rewriting my dating app profile, the experimental AI chatbot tried to turn me into a cringey manic pixie dream girl who forgets to water her "jungle" of houseplants, dances to her favorite "tunes" and is looking for "a fellow weirdo" to go on *shudders* "adventures" with.


Look What ChatGPT Did to My Online Dating Profile - CNET

#artificialintelligence

For the record, I don't own any socks with sloths on them. I have three pairs with the CNET logo on them. ChatGPT thinks I might, though, and it also thinks this fact could get me matches on Hinge, or Bumble, or any dating app that has the audacity to ask me for a random fact about myself. Click to read more Love Syncs. Here's a random fact about me: When I tested how ChatGPT might handle rewriting my dating app profile, the experimental AI chatbot tried to turn me into a cringey manic pixie dream girl who forgets to water her "jungle" of houseplants, dances to her favorite "tunes" and is looking for "a fellow weirdo" to go on *shudders* "adventures" with.


Will It Blend? Mixing Training Paradigms & Prompting for Argument Quality Prediction

van der Meer, Michiel, Reuver, Myrthe, Khurana, Urja, Krause, Lea, Santamaría, Selene Báez

arXiv.org Artificial Intelligence

This paper describes our contributions to the Shared Task of the 9th Workshop on Argument Mining (2022). Our approach uses Large Language Models for the task of Argument Quality Prediction. We perform prompt engineering using GPT-3, and also investigate the training paradigms multi-task learning, contrastive learning, and intermediate-task training. We find that a mixed prediction setup outperforms single models. Prompting GPT-3 works best for predicting argument validity, and argument novelty is best estimated by a model trained using all three training paradigms.


SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

Lee, Kimin, Laskin, Michael, Srinivas, Aravind, Abbeel, Pieter

arXiv.org Artificial Intelligence

Model-free deep reinforcement learning (RL) has been successful in a range of challenging domains. However, there are some remaining issues, such as stabilizing the optimization of nonlinear function approximators, preventing error propagation due to the Bellman backup in Q-learning, and efficient exploration. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. SUNRISE integrates three key ingredients: (a) bootstrap with random initialization which improves the stability of the learning process by training a diverse ensemble of agents, (b) weighted Bellman backups, which prevent error propagation in Q-learning by reweighing sample transitions based on uncertainty estimates from the ensembles, and (c) an inference method that selects actions using highest upper-confidence bounds for efficient exploration. Our experiments show that SUNRISE significantly improves the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.


Variational Calibration of Computer Models

Marmin, Sébastien, Filippone, Maurizio

arXiv.org Machine Learning

Bayesian calibration of black-box computer models offers an established framework to obtain a posterior distribution over model parameters. Traditional Bayesian calibration involves the emulation of the computer model and an additive model discrepancy term using Gaussian processes; inference is then carried out using MCMC. These choices pose computational and statistical challenges and limitations, which we overcome by proposing the use of approximate Deep Gaussian processes and variational inference techniques. The result is a practical and scalable framework for calibration, which obtains competitive performance compared to the state-of-the-art.


Generalized Earthquake Frequency-Magnitude Distribution Described by Asymmetric Laplace Mixture Modelling

Mignan, Arnaud

arXiv.org Machine Learning

The complete part of the earthquake frequency-magnitude distribution (FMD), above completeness magnitude mc, is well described by the Gutenberg-Richter law. The parameter mc however varies in space due to the seismic network configuration, yielding a convoluted FMD shape below max(mc). This paper investigates the shape of the generalized FMD (GFMD), which may be described as a mixture of elemental FMDs (eFMDs) defined as asymmetric Laplace distributions of mode mc [Mignan, 2012, https://doi.org/10.1029/2012JB009347]. An asymmetric Laplace mixture model (GFMD- ALMM) is thus proposed with its parameters (detection parameter kappa, Gutenberg-Richter beta-value, mc distribution, as well as number K and weight w of eFMD components) estimated using a semi-supervised hard expectation maximization approach including BIC penalties for model complexity. The performance of the proposed method is analysed, with encouraging results obtained: kappa, beta, and the mc distribution range are retrieved for different GFMD shapes in simulations, as well as in regional catalogues (southern and northern California, Nevada, Taiwan, France), in a global catalogue, and in an aftershock sequence (Christchurch, New Zealand). We find max(mc) to be conservative compared to other methods, kappa = k/log(10) = 3 in most catalogues (compared to beta = b/log(10) = 1), but also that biases in kappa and beta may occur when rounding errors are present below completeness. The GFMD-ALMM, by modelling different FMD shapes in an autonomous manner, opens the door to new statistical analyses in the realm of incomplete seismicity data, which could in theory improve earthquake forecasting by considering c. ten times more events.